AITopics

Country: North America > United States (0.93)

Genre: Research Report (0.92)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Data Science (0.67)

Neural Information Processing SystemsFeb-8-2026, 07:15:56 GMT

1687466683649e8bdcdec0e3f5c8de64-Paper-Conference.pdf

gaussian, learner, learning, (13 more...)

Country:

North America > United States > District of Columbia > Washington (0.04)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.92)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Data Science (0.67)

Neural Information Processing SystemsDec-24-2025, 00:54:05 GMT

Private Distribution Learning with Public Data: The View from Sample Compression

We study the problem of private distribution learning with access to public data. In this setup, which we refer to as *public-private learning*, the learner is given public and private samples drawn from an unknown distribution $p$ belonging to a class $\mathcal Q$, with the goal of outputting an estimate of $p$ while adhering to privacy constraints (here, pure differential privacy) only with respect to the private samples. We show that the public-private learnability of a class $\mathcal Q$ is connected to the existence of a sample compression scheme for $\mathcal Q$, as well as to an intermediate notion we refer to as \emph{list learning}. Leveraging this connection: (1) approximately recovers previous results on Gaussians over $\mathbb R^d$; and (2) leads to new ones, including sample complexity upper bounds for arbitrary $k$-mixtures of Gaussians over $\mathbb R^d$, results for agnostic and distribution-shift resistant learners, as well as closure properties for public-private learnability under taking mixtures and products of distributions. Finally, via the connection to list learning, we show that for Gaussians in $\mathbb R^d$, at least $d$ public samples are necessary for private learnability, which is close to the known upper bound of $d+1$ public samples.

name change, private distribution learning, public data, (9 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.62)

Setlur, Amrith, Thaker, Pratiksha, Ullman, Jonathan

Lower Bounds for Public-Private Learning under Distribution Shift

arXiv.org Artificial IntelligenceJul-25-2025

The most effective differentially private machine learning algorithms in practice rely on an additional source of purportedly public data. This paradigm is most interesting when the two sources combine to be more than the sum of their parts. However, there are settings such as mean estimation where we have strong lower bounds, showing that when the two data sources have the same distribution, there is no complementary value to combining the two data sources. In this work we extend the known lower bounds for public-private learning to setting where the two data sources exhibit significant distribution shift. Our results apply to both Gaussian mean estimation where the two distributions have different means, and to Gaussian linear regression where the two distributions exhibit parameter shift. We find that when the shift is small (relative to the desired accuracy), either public or private data must be sufficiently abundant to estimate the private parameter. Conversely, when the shift is large, public data provides no benefit.

artificial intelligence, machine learning, priv, (16 more...)

2507.17895

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

arXiv.org Artificial IntelligenceJan-31-2025

Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion

Zou, Tianyuan, Liu, Yang, Li, Peng, Xiong, Yufei, Zhang, Jianqing, Liu, Jingjing, Ye, Xiaozhou, Ouyang, Ye, Zhang, Ya-Qin

Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis %that avoid fine-tuning large pre-trained generative models often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models.Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://anonymous.4open.science/r/WASP.

dataset, plm, private sample, (13 more...)

2502.00245

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Neural Information Processing SystemsOct-10-2024, 00:46:56 GMT

Private Distribution Learning with Public Data: The View from Sample Compression

We study the problem of private distribution learning with access to public data. In this setup, which we refer to as *public-private learning*, the learner is given public and private samples drawn from an unknown distribution p belonging to a class \mathcal Q, with the goal of outputting an estimate of p while adhering to privacy constraints (here, pure differential privacy) only with respect to the private samples. We show that the public-private learnability of a class \mathcal Q is connected to the existence of a sample compression scheme for \mathcal Q, as well as to an intermediate notion we refer to as \emph{list learning}. Leveraging this connection: (1) approximately recovers previous results on Gaussians over \mathbb R d; and (2) leads to new ones, including sample complexity upper bounds for arbitrary k -mixtures of Gaussians over \mathbb R d, results for agnostic and distribution-shift resistant learners, as well as closure properties for public-private learnability under taking mixtures and products of distributions. Finally, via the connection to list learning, we show that for Gaussians in \mathbb R d, at least d public samples are necessary for private learnability, which is close to the known upper bound of d 1 public samples.

private distribution learning, public data, sample compression, (5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.66)

Ben-David, Shai, Bie, Alex, Canonne, Clément L., Kamath, Gautam, Singhal, Vikrant

Private Distribution Learning with Public Data: The View from Sample Compression

arXiv.org Artificial IntelligenceAug-14-2023

We study the problem of private distribution learning with access to public data. In this setup, which we refer to as public-private learning, the learner is given public and private samples drawn from an unknown distribution $p$ belonging to a class $\mathcal Q$, with the goal of outputting an estimate of $p$ while adhering to privacy constraints (here, pure differential privacy) only with respect to the private samples. We show that the public-private learnability of a class $\mathcal Q$ is connected to the existence of a sample compression scheme for $\mathcal Q$, as well as to an intermediate notion we refer to as list learning. Leveraging this connection: (1) approximately recovers previous results on Gaussians over $\mathbb R^d$; and (2) leads to new ones, including sample complexity upper bounds for arbitrary $k$-mixtures of Gaussians over $\mathbb R^d$, results for agnostic and distribution-shift resistant learners, as well as closure properties for public-private learnability under taking mixtures and products of distributions. Finally, via the connection to list learning, we show that for Gaussians in $\mathbb R^d$, at least $d$ public samples are necessary for private learnability, which is close to the known upper bound of $d+1$ public samples.

artificial intelligence, gaussian, machine learning, (14 more...)

2308.06239

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > District of Columbia > Washington (0.04)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceJun-19-2023

On the Robustness of Dataset Inference

Szyller, Sebastian, Zhang, Rui, Liu, Jian, Asokan, N.

Machine learning (ML) models are costly to train as they can require a significant amount of data, computational resources and technical expertise. Thus, they constitute valuable intellectual property that needs protection from adversaries wanting to steal them. Ownership verification techniques allow the victims of model stealing attacks to demonstrate that a suspect model was in fact stolen from theirs. Although a number of ownership verification techniques based on watermarking or fingerprinting have been proposed, most of them fall short either in terms of security guarantees (well-equipped adversaries can evade verification) or computational cost. A fingerprinting technique, Dataset Inference (DI), has been shown to offer better robustness and efficiency than prior methods. The authors of DI provided a correctness proof for linear (suspect) models. However, in a subspace of the same setting, we prove that DI suffers from high false positives (FPs) -- it can incorrectly identify an independent model trained with non-overlapping data from the same distribution as stolen. We further prove that DI also triggers FPs in realistic, non-linear suspect models. We then confirm empirically that DI in the black-box setting leads to FPs, with high confidence. Second, we show that DI also suffers from false negatives (FNs) -- an adversary can fool DI (at the cost of incurring some accuracy loss) by regularising a stolen model's decision boundaries using adversarial training, thereby leading to an FN. To this end, we demonstrate that black-box DI fails to identify a model adversarially trained from a stolen dataset -- the setting where DI is the hardest to evade. Finally, we discuss the implications of our findings, the viability of fingerprinting-based ownership verification in general, and suggest directions for future work.

artificial intelligence, machine learning research, verification, (17 more...)

2210.13631

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Maryland > Baltimore (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report > New Finding (0.87)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)

arXiv.org Machine LearningApr-22-2020

Private Query Release Assisted by Public Data

Bassily, Raef, Cheu, Albert, Moran, Shay, Nikolov, Aleksandar, Ullman, Jonathan, Wu, Zhiwei Steven

We study the problem of differentially private query release assisted by access to public data. In this problem, the goal is to answer a large class $\mathcal{H}$ of statistical queries with error no more than $\alpha$ using a combination of public and private samples. The algorithm is required to satisfy differential privacy only with respect to the private samples. We study the limits of this task in terms of the private and public sample complexities. First, we show that we can solve the problem for any query class $\mathcal{H}$ of finite VC-dimension using only $d/\alpha$ public samples and $\sqrt{p}d^{3/2}/\alpha^2$ private samples, where $d$ and $p$ are the VC-dimension and dual VC-dimension of $\mathcal{H}$, respectively. In comparison, with only private samples, this problem cannot be solved even for simple query classes with VC-dimension one, and without any private samples, a larger public sample of size $d/\alpha^2$ is needed. Next, we give sample complexity lower bounds that exhibit tight dependence on $p$ and $\alpha$. For the class of decision stumps, we give a lower bound of $\sqrt{p}/\alpha$ on the private sample complexity whenever the public sample size is less than $1/\alpha^2$. Given our upper bounds, this shows that the dependence on $\sqrt{p}$ is necessary in the private sample complexity. We also give a lower bound of $1/\alpha$ on the public sample complexity for a broad family of query classes, which by our upper bound, is tight in $\alpha$.

algorithm, complexity, sample complexity, (14 more...)

arXiv.org Machine Learning

2004.10941

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Ohio (0.04)
North America > United States > Minnesota (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (1.00)

Bassily, Raef, Nandi, Anupama

Privately Answering Classification Queries in the Agnostic PAC Model

arXiv.org Machine LearningJul-31-2019

We revisit the problem of differentially private release of classification queries. In this problem, the goal is to design an algorithm that can accurately answer a sequence of classification queries based on a private training set while ensuring differential privacy. We formally study this problem in the agnostic PAC model and derive a new upper bound on the private sample complexity. Our results improve over those obtained in a recent work [BTT18] for the agnostic PAC setting. In particular, we give an improved construction that yields a tighter upper bound on the sample complexity. Moreover, unlike [BTT18], our accuracy guarantee does not involve any blow-up in the approximation error associated with the given hypothesis class. Given any hypothesis class with VC-dimension $d$, we show that our construction can privately answer up to $m$ classification queries with average excess error $\alpha$ using a private sample of size $\approx \frac{d}{\alpha^2}\max\left(1, \sqrt{m}\alpha^{3/2}\right)$. Using recent results on private learning with auxiliary public data, we extend our construction to show that one can privately answer any number of classification queries with average excess error $\alpha$ using a private sample of size $\approx \frac{d}{\alpha^2}\max\left(1, \sqrt{d} \alpha\right)$. Our results imply that when $\alpha$ is sufficiently small (high-accuracy regime), the private sample size is essentially the same as the non-private sample complexity of agnostic PAC learning.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1907.13553

Genre: Research Report (0.90)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.68)